Skip to content

Conversation

kebe7jun
Copy link
Contributor

@kebe7jun kebe7jun commented Aug 27, 2025

Purpose

Add streaming support for non-harmony

Related issue #23225

Test Plan

Unit tests and self tests(see result).

Test Result

GPT-OSS Stream output
ResponseCreatedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=0, delta='User', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' wants', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' us', item_id='', output_index=0, sequence_number=6, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=0, delta=' but', item_id='', output_index=0, sequence_number=110, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' okay', item_id='', output_index=0, sequence_number=111, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta='.', item_id='', output_index=0, sequence_number=112, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=0, item_id='', output_index=1, sequence_number=113, text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=1, sequence_number=114, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=1, sequence_number=115, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=116, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='double', item_id='', logprobs=[], output_index=1, sequence_number=117, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=118, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=145, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bath', item_id='', logprobs=[], output_index=1, sequence_number=146, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=0, item_id='', logprobs=[], output_index=2, sequence_number=147, text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='', output_index=2, part=ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None), sequence_number=148, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None)], role='assistant', status='completed', type='message'), output_index=2, sequence_number=149, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=81, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=149, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=230), user=None), sequence_number=150, type='response.completed')
Qwen3 30B A3B Stream output
ResponseCreatedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=1, delta='\n', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=2, delta='Okay', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=256, delta='.\n', item_id='', output_index=0, sequence_number=259, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=257, item_id='', output_index=0, sequence_number=260, text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=0, sequence_number=261, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=262, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=263, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=1, delta='\n\n', item_id='', logprobs=[], output_index=1, sequence_number=264, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=2, delta='Double', item_id='', logprobs=[], output_index=1, sequence_number=265, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=42, delta='', item_id='', logprobs=[], output_index=1, sequence_number=305, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=43, item_id='', logprobs=[], output_index=1, sequence_number=306, text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=44, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None), sequence_number=307, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None)], role='assistant', status='completed', type='message', summary=[]), output_index=1, sequence_number=308, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=18, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=300, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=318), user=None), sequence_number=309, type='response.completed')

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from 8dc2da4 to b65638e Compare August 27, 2025 11:32
@mergify mergify bot added the frontend label Aug 27, 2025
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from b65638e to 3bb6902 Compare August 27, 2025 11:46
@mergify mergify bot added the v1 label Aug 27, 2025
@kebe7jun kebe7jun marked this pull request as ready for review August 27, 2025 11:55
@kebe7jun kebe7jun requested a review from aarnphm as a code owner August 27, 2025 11:55
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 2 times, most recently from 6d9fe9c to af25d9a Compare August 28, 2025 01:37
@kebe7jun
Copy link
Contributor Author

@heheda12345 PTAL

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. Some small comments.

) -> AsyncGenerator[str, None]:
sequence_number = 0
current_content_index = 0 # FIXME: this number is never changed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix these indexes? Reference: #23382

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77bd0aa to bc4c5ae Compare September 3, 2025 03:18
@@ -864,7 +861,7 @@ async def _process_simple_streaming_events(
created_time: int,
_send_event: Callable[[BaseModel], str],
) -> AsyncGenerator[str, None]:
current_content_index = 0 # FIXME: this number is never changed
current_content_index = 0
current_output_index = 0
current_item_id = "" # FIXME: this number is never changed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update. Can you also update the "current_item_id"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder, my apologies for the oversight, fixed.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from bc4c5ae to cf993d1 Compare September 3, 2025 07:51
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your contribution.

@heheda12345 heheda12345 enabled auto-merge (squash) September 3, 2025 18:11
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025
@heheda12345
Copy link
Collaborator

@kebe7jun The v1-test-entrypoints CI failure seems to be related to this PR. Can you take a look?

 v1/entrypoints/openai/responses/test_basic.py::test_streaming - TypeError: 'AsyncStream' object is not iterable

auto-merge was automatically disabled September 4, 2025 01:14

Head branch was pushed to by a user without write access

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77ef2de to 30d435e Compare September 4, 2025 04:37
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from 30d435e to 3e604da Compare September 4, 2025 04:38
@DarkLight1337 DarkLight1337 merged commit 8f423e5 into vllm-project:main Sep 4, 2025
39 checks passed
@kebe7jun kebe7jun deleted the feature/responses-api-streaming branch September 4, 2025 09:49
JasonZhu1313 pushed a commit to JasonZhu1313/vllm that referenced this pull request Sep 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants